Skip to content

spike(workflow): concurrent dynamic dispatch harness (DynamicNodeSupervisor + ctx.pipeline)#3

Draft
caohy1988 wants to merge 64 commits into
mainfrom
spike/dynamic-supervisor-concurrency
Draft

spike(workflow): concurrent dynamic dispatch harness (DynamicNodeSupervisor + ctx.pipeline)#3
caohy1988 wants to merge 64 commits into
mainfrom
spike/dynamic-supervisor-concurrency

Conversation

@caohy1988

@caohy1988 caohy1988 commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Draft / staging PR (fork-internal), rebased onto current main. Four RFC spike artifacts under contributing/samples/workflows/. Not the upstream PR to google/adk-python.
Diff is now purely additive samples — 4 directories under contributing/samples/workflows/ plus a .gitignore entry (8,740 insertions, 0 deletions), rebased onto current main. The earlier one-line src/ fix to workflow/_llm_agent_wrapper.py::prepare_llm_agent_context (carry the dispatch's isolation_scope into the agent context) has been dropped as redundant: upstream main now propagates isolation_scope in prepare_llm_agent_context itself (_llm_agent_wrapper.py:210,217, commit f39d75b9 / PiperOrigin-RevId 933314484), so the patch is no longer needed. The bug the samples originally uncovered — a fanned-out single_turn agent making MULTIPLE model calls (tool loop) rebuilding its context from the LATEST sibling's unscoped input, observed live as parallel tool-using skeptics all answering the last claim — is therefore fixed upstream. Isolation now rests on that upstream fix plus the supervisor giving every dispatch its own sub-branch + per-dispatch isolation scope; the deterministic spy-driven regression test (no network) still simulates the tool loop and proves siblings don't swap inputs (it now passes against unmodified upstream). Four new sample directories:

1. dynamic_supervisor_spike/ (co9

Prototype DynamicNodeSupervisor (gate-on-leaf, TaskGroup fan-out) + ctx.pipeline/ctx.parallel on the real ADK engine.

  • supervisor.py, test_dynamic_supervisor_spike.py (11 deterministic tests), test_live_gemini_e2e.py (env-gated), README.md.
  • Barrier-free execution, failed-item isolation, control-exception cancellation (requires TaskGroup, not gather), nested no-deadlock with leaf gating (+ driver-gating deadlock contrast), resume exactly-once for completed children.

2. authored_workflow_spike/ (agent-authored typed Workflows)

A model emits a declarative, validated WorkflowSpec (typed data, not code) executed on the real engine via the supervisor.

  • authoring.py (WorkflowSpec plain kind-tagged-union tree, CapabilityRegistry, WorkflowSpecValidator incl. plan-quality lints, SpecInterpreter for step/fan_out/pipeline/branch/loop_until with loop-carried LoopUntil.init + dispatch_count, plus FrozenWorkflowRecord/export_plan/import_plan, derived contract-hash drift (primary signal, fail-closed on stripped hashes; manual versions secondary), auditable lint waivers (allow_self_chain policy + per-plan waivers recorded in the frozen record), and independence_facts), test_authoring.py (36 deterministic tests, incl. a barrier-free pipeline proof, per-stage max_fan_out enforcement, plan export/import round-trip + tamper / capability+registry version drift / schema_version checks, ADK-config lowering of the static subset, six-coordination-pattern coverage — adversarial verification + tournament — and the plan-quality lints), test_live_planner_sweep.py (env-gated; multi-stage/branch/loop on gemini-3.5-flash), DESIGN.md (canonical technical design incl. plan export/storage tiers), README.md.
  • Findings folded into Tutorial: Updated call_agent_async function not provided google/adk-python#93: open-dict maps are a structured-output hazard (Branch.routeslist[Route]); Gemini response_schema rejects Field(discriminator=...) so the vocabulary is a plain kind-tagged union; planning vs capability quality are separable; the pattern sweep surfaced a vocabulary gap — tournament needs loop-carried state, hence LoopUntil.init (binding the loop's own id without init is a validation error).
  • Independence is statically checkable (the typed-bindings advantage over model-written orchestration code): the validator lints same-capability self-review (self-preferential bias) and unsynthesized fan-out; independence_facts() derives the provenance statements ("stage verifier sees ONLY stage reviewer's per-item output") the frozen record proves to an auditor.
  • Convergence with ADK Workflow config / root_agent.yaml (DESIGN §11, per reviewer feedback on Tutorial: Updated call_agent_async function not provided google/adk-python#93): static graph skeletons should lower/export toward the contributing/samples/workflows/loop_config/root_agent.yaml style (agent_class: Workflow, static edges, child YAML), while WorkflowSpec remains the model-facing source of truth. Raw YAML is not the planner output because loop_config intentionally resolves Python function refs (.agent.route_headline), _code refs, child config paths, tools/callbacks, and possibly FQNs. branch/runtime fan_out/pipeline stay new typed blocks because config does not directly express runtime per-item dispatch / barrier-free multi-stage flow; YAML would need a wrapper node. Caveat: Workflow itself is not deprecated, but the current config loader path and agent-config sugar classes are @deprecated + @experimental; this is convergence with the Workflow config shape for compatibility, not a long-term dependency on today's loader or deprecated sugar.

3. authored_workflow_demo/ — ADK Web demo wrapper

A discoverable root_agent (a Workflow) that exposes the flow in ADK Web — authors (scripted or free-form — the freely/decompose trigger gives the planner only goal + capability descriptions, no recipe) → validates → independence-lints → freezes-to-state → exports → lowers static config shape → executes with a cost line (capability dispatches + planner calls), surfacing each step as a chat message (authored plan, validation, independence facts, capabilities, frozen hash, exported plan, config projection, output, cost).

  • security_audit_planner/agent.py, test_demo_agent.py (8 CI-safe tests, incl. a no-LLM reuse-path test + dispatch-count assertion, the lints/independence-facts check, the quality-gate (self-review rejection) shape pin, the recipe-free free-authoring instruction pin, and the config-lowering assertion), README.md (the ~7-min recording script), DEMO_NARRATIVE.md (beat-by-beat narration from a real run).
  • Load-or-author: an existing frozen spec is reused (planner not re-invoked) — the resume/reproducibility claim is real and CI-verified.
  • Run: adk web contributing/samples/workflows/authored_workflow_demo.

4. authored_workflow_ca_demo/ — BigQuery Conversational Analytics planner

One agent, seven prompts, seven authored shapes — the pattern-coverage table made tangible in a real product surface (styled after BigQuery Conversational Analytics). Query execution is REAL BigQuery when credentials allow: dry_run hits the actual dry-run API (real errors + bytes-scanned, with the user question preserved through the loop-carried value so repair rounds get full context) and run_query executes against bigquery-public-data.thelook_ecommerce (billed to GOOGLE_CLOUD_PROJECT; maximum_bytes_billed = 2 GB/query, 500-row cap; a failing query returns the real error — never fabricated fallback rows). Without credentials, a deterministic micro-warehouse (synthetic facts + SQL-intent aggregation incl. month/quarter/year grains) takes over; every beat carries an engine field. The default ask-a-question flow is the real CA shape: loop_until(draft → real dry-run → repair from the real error) → run_query → render_chart + summarize, with the live question as task input (template reuse on re-send). Charts: Vega-Lite artifact + inline matplotlib PNG, line inference for time series, multi-series (one line per region) for two-dimensional results. Cross-session reuse: frozen plans export their full FrozenWorkflowRecord to a plan store and new sessions import them through the defensive path (hash/registry/declared-contract checks; new questions validated against the captured task_input_schema); the typed object-output capabilities declare output models so contract hashes have teeth (primitive helpers — sql_ok, pair_charts, judge_chart, single_chart — return bare values and rely on manual versions).

  • bq_ca_planner/agent.py, test_ca_demo_agent.py (38 collected — 36 CI-safe + 1 live-gated real-BigQuery round-trip + 1 tool-loop sibling-isolation proof (now runs against unmodified upstream — isolation_scope is fixed in main): all seven shapes validated + lint-clean + executed end-to-end with language capabilities stubbed, engine grains/windows/filters, qualify/jsonify, no-fabrication contract, question-preservation through failed dry runs, multi-series chart pivot, cross-session store round-trip/tamper/drift, the conversational intent gate incl. a no-LLM end-to-end escape-path test, live-insight audits with last-insight memory, the DATA-GROUNDED skeptic (real query tool + output_schema, v2 contract bump), runtime isolation proofs — single-call and tool-loop — and skeptic verdicts with reasons), README.md (prompt table + the 10-turn mixed-workflow conversation script).
  • Run: adk web contributing/samples/workflows/authored_workflow_ca_demo --port 8001.

Hygiene

pyink / isort / mdformat / pre-commit clean. Deterministic suites: 11 + 36 + 8 + 38 = 93 collected (gated skips in CI as noted) green. Live tests env-gated (skip without SPIKE_LIVE + project). The fork-only agent-triage-pull-request check fails on an empty GITHUB_TOKEN (non-actionable; won't occur upstream).

@caohy1988 caohy1988 force-pushed the spike/dynamic-supervisor-concurrency branch 6 times, most recently from 8e8905f to fd3bd52 Compare June 2, 2026 07:27
caohy1988 added a commit that referenced this pull request Jun 8, 2026
Caught in review of #6: the C7 pair keys
(pause_kind, function_call_id) were being passed via
EventData.extra_attributes, which _enrich_attributes() copies at the
top of attrs *before* attrs["adk"] = _build_adk_envelope(...). That
landed them at attributes.pause_kind / attributes.function_call_id,
not attributes.adk.pause_kind / attributes.adk.function_call_id.

The customer SQL pinned in google#293 v5 acceptance #3 is:

  JSON_VALUE(attributes, '$.adk.function_call_id') = JSON_VALUE(...)

so the pair join would have returned null on every row. This commit
makes the contract match the SQL.

Changes:
* EventData gains adk_extras: dict[str, Any], a sibling of
  extra_attributes that lives INSIDE attributes.adk.
* _enrich_attributes merges adk_extras into the envelope after
  _build_adk_envelope (envelope wins on conflict — producer-derived
  identity fields like source_event_id are the source of truth).
* The two emit sites (TOOL_PAUSED in on_event_callback,
  TOOL_COMPLETED in on_user_message_callback) pass the pair keys via
  adk_extras= instead of extra_attributes=.
* The three C7 tests are updated to assert
  json.loads(row["attributes"])["adk"]["pause_kind"] etc., locking
  in the right shape this time.

Full plugin suite: 252 passed.
caohy1988 added a commit that referenced this pull request Jun 9, 2026
…ee-authoring beat

Implements the three maintainer-review items that strengthen gate evidence
(the rest stay RFC-design-level):

* Contract-hash drift (review #3): Capability.contract_hash() =
  sha256(input_kind + output schema), recorded per referenced capability in
  FrozenWorkflowRecord; import rejects contract drift even when the manual
  version string was never bumped (manual versions demoted to secondary).
* Auditable lint suppression (review #6): allow_self_chain capability policy
  opts legitimate draft->critique->redraft refinement out of the self-review
  lint; per-plan lint_waivers (node id -> justification) suppress lints AND
  are recorded in the frozen record/export envelope.
* Free-authoring beat (demo review): 'freely/decompose' trigger gives the
  planner ONLY goal + capability descriptions (no recipe) — the honest
  model-authored claim; recipe-free instruction pinned by a CI test.

Suites: 11 + 34 + 8 = 53 deterministic tests.
caohy1988 added 19 commits June 24, 2026 14:01
…rvisor + ctx.pipeline)

RFC google#92 reference harness. DynamicNodeSupervisor (gate-on-leaf, TaskGroup
fan-out) + ctx.pipeline/ctx.parallel on the real ADK Workflow engine.
11 deterministic CI-safe tests (no LLM) + an env-gated live Gemini E2E.
Proves barrier-free execution, failed-item isolation, control-exception
cancellation (requires TaskGroup, not gather), nested no-deadlock with leaf
gating (+ driver-gating deadlock contrast), and resume exactly-once for
completed children. pyink/isort/mdformat clean.
A model emits a declarative, validated WorkflowSpec (typed data, not code) that
the framework validates and executes on the real ADK engine via the google#92
supervisor. authoring.py (WorkflowSpec plain kind-tagged recursive union;
CapabilityRegistry; WorkflowSpecValidator; SpecInterpreter for step/fan_out/
branch/loop_until), 10 deterministic tests, env-gated live planner sweep
(multi-stage/branch/loop on gemini-3.5-flash, shape-specific assertions).

Findings folded into the RFC: open dict[str,X] maps are a structured-output
hazard (Branch.routes -> list[Route]); Gemini response_schema rejects
Field(discriminator=...), so the vocabulary is a plain kind-tagged union;
planning vs capability quality are separable. pyink/isort/mdformat clean.
A discoverable `root_agent` (a Workflow) that exposes the google#93 flow in ADK Web:
the model authors a typed WorkflowSpec, ADK validates it against the capability
registry, freezes spec+hash to session state, and executes it on the real
engine via the google#92 supervisor — each step surfaced as a chat message so the
authored plan, validation, capabilities, frozen hash, and final output are all
visible in the ADK Web chat / State / Events surfaces.

  adk web contributing/samples/workflows/authored_workflow_demo

Load-or-author: if a frozen spec already exists in the session it is REUSED
(planner not re-invoked) and replayed — so the resume/reproducibility claim is
real, verified by a CI-safe no-LLM test (test_demo_agent.py: import + name +
registry + spec validation + reuse-path-with-stub-registry). Reuses the
authored_workflow_spike/ stack; model from env (default gemini-2.5-flash;
gemini-3.5-flash needs location=global); no hardcoded project. README is the
~7-min recording script. pyink/isort/mdformat clean; 4 demo tests pass.
…rage tiers)

Standalone design for the authored-workflow spike: data model, validator,
interpreter, frozen-spec contract, security model, testing, empirical findings,
and the plan export/storage tiers (v1 per-run persist; v1.1 portable JSON
export envelope; v2 reusable templates with import-time registry revalidation;
compiled Workflow is a derived artifact, never the stored source of truth).
…a, import-input rule

DESIGN.md §10: spec_hash/task_input_digest defined as sha256 over canonical JSON;
envelope carries an optional task_input_schema; import contract — digest is
advisory provenance for replaying the original run, template reuse validates a
new task input against task_input_schema, else replay-only on matching digest or
explicit template promotion. Never silently bind a stored plan to an
incompatible task shape.
One FrozenWorkflowRecord backs session state, audit event, and export envelope
(§5/§10) — v1 persists the full record under authored_workflow:frozen_record,
not a weaker {spec,hash} subset. import_plan recomputes spec_hash (reject on
mismatch) and re-runs validation against the current registry rather than
trusting envelope.validation; replay vs template execution-input rule made
explicit. 'discriminated-by-kind' -> 'plain kind-tagged union' wording fix.
Demo persists only {spec, hash}; production v1 stores the full
FrozenWorkflowRecord (DESIGN.md §5). State it explicitly in DESIGN.md, the demo
agent.py freeze step, and DEMO_NARRATIVE.md so the demo isn't misread as the
canonical persistence contract.
…aude Code comparison

Pipeline/PipelineStage make barrier-free multi-stage per-item flow first-class so
the authoring vocabulary is not less expressive than its google#92 executor. Add a
candid 'Comparison to Claude Code Dynamic Workflows' (wins: audit/safety; gaps:
expressiveness/maturity, plan-size ceiling, quality-pattern templates, scale).
Hierarchical/sub-plan authoring noted as post-gate future, not MVP.
authoring.py now covers step/fan_out/pipeline/branch/loop_until. Pipeline +
PipelineStage compile to google#92 ctx.pipeline: each item threads ALL stages
barrier-free (item A in stage k while item B in stage 1) — NOT two barriered
fan_outs. Validator: over is a list, every stage capability exists and takes an
item, stage input scope. Interpreter: stage[0] input defaults to the per-item
element, stage[n] to stage[n-1] output; collect=list.

3 new deterministic tests (13 total): validator accept/reject pipeline, and an
ordered + BARRIER-FREE proof (a verifier starts before the slow reviewer
finishes — impossible with two barriered fan_outs).
P2: each Pipeline stage dispatches once per item, so every stage capability is
subject to the same data-dependent fan_out cap as FanOut. The interpreter now
rejects (pre-dispatch) when len(items) > stage cap.max_fan_out, closing a gap
where a pipeline over N items bypassed a capability's cap — the RFC security
model relies on runtime enforcement of these caps. New deterministic test
(14 total) asserts rejection before any stage runs.

P3: sync stale shape lists (add pipeline) and test counts after the prior
Pipeline addition — authoring 10/13 -> 14, totals 11+14+4 = 29, across
authoring.py, test_authoring.py, both READMEs, DESIGN.md, DEMO_NARRATIVE.md.
…y-audit demo

The demo now exercises Pipeline — the construct that closes the Claude Code gap
— without adding visual complexity. The planner authors pipeline -> step ->
step: a reviewer->verifier pipeline over the files (each file reviewed then its
finding verified, barrier-free per item), then triager, then formatter.

- agent.py: add a 'verifier' capability (Finding -> confirmed Finding); planner
  instruction authors the pipeline; capability collection walks pipeline stages
  so the displayed/audited capability set includes stage caps.
- test_demo_agent.py: demo spec + stub registry use Pipeline/verifier; reuse
  path still proves no-LLM frozen replay.
- Narrative + README: Beat 1 plan is pipeline -> step -> step; Beat 4 calls out
  reviewer/verifier interleaving per file; fresh captured hash 1f4c0883beb6.

Validated live on gemini-3.5-flash: planner authored the pipeline 3/3 trials
(no flakiness); reuse replays the same hash without re-invoking the model.
…e registry

The first chat message hardcoded (reviewer, triager, formatter) and so
contradicted the validation list after verifier was added. Derive it from
CapabilityRegistry.names() (new accessor) so the recording can't drift from the
registered set again.
…plan + demo beat

Makes the frozen plan a first-class, portable artifact (DESIGN.md §10), so the
RFC's enterprise claim (reviewable / diffable / replayable model-authored plans)
is real, not paper.

authoring.py:
- FrozenWorkflowRecord (the single §5 shape) + ValidationResult; FrozenWorkflowRecord.freeze() captures spec_hash, planner_model, registry + per-capability versions, validation, task_input_digest.
- canonical_json / sha256_hex — the one fixed hash definition (sort_keys + tight separators) so two exporters agree.
- export_plan(record) -> dict; import_plan(envelope, registry, task_input=None) that NEVER trusts the envelope: recompute sha256 (reject tamper), re-validate vs CURRENT registry (reject dropped capability), reject per-capability version drift; execution-input contract (replay needs matching input digest; template needs task_input_schema).
- Capability.version + CapabilityRegistry.capability_versions() for drift detection; referenced_capabilities() walker.

test_authoring.py (+5 -> 19): round-trip replays same hash; tamper rejected; dropped capability rejected; version drift rejected; new input without template schema rejected (and accepted once a schema is attached).

demo: an 'Export plan' beat writes the full envelope to security_audit_plan.json and re-imports it (proving defensive import). Unifies the displayed hash on the canonical definition. Narrative/README gain Beat 3b; counts 11+19+4 = 34. Generated envelope + demo session dbs gitignored.

Validated live on gemini-3.5-flash: export beat writes a complete envelope, re-import passes, reuse replays the same hash.
…sort

P1: isort ordering in test_authoring.py imports (PlanImportError after
PipelineStage) — pre-commit clean.

P2: import_plan now hard-errors on registry_version drift (envelope vs current
registry.version), matching DESIGN.md §10 'registry-version match … drift = hard
error'. Previously only dropped capabilities + per-capability versions were
checked.

P3: import_plan rejects an unsupported schema_version (only 'v1' supported) — a
defensive importer refuses formats it can't read.

+2 deterministic tests (-> 21): registry-version drift and unsupported
schema_version both rejected. Counts 11+21+4 = 36.
…bservability

Adds DESIGN §11 'Convergence with ADK AgentConfig' (renumbers Future -> §12) in
response to reviewer questions on issue google#93:

- Lower the static subset (sequence/parallel/loop) to ADK's Sequential/Parallel/
  LoopAgentConfig instead of reinventing serialization; keep branch/fan_out/
  pipeline as new types only because config can't express them (static sub_agents
  resolved once at load; no ConditionalAgent; needs google#92 ctx.pipeline).
- Why the planner does NOT emit raw AgentConfig: static graph; Discriminator union
  rejected by response_schema; FQN tool/agent/callback refs (importlib, no
  allow-list) re-open the code-exec surface the declarative+allow-list model closes.
- Q1 storage: FrozenWorkflowRecord in session State + audit event + export envelope.
- Q2 custom tools: registered capability by registry name (allow-list), not FQN.
- Q3 version/observability: spec_hash + registry/capability versions -> drift
  rejected on import; compiled Workflow runs on the real engine so ADK tracing applies.
All claims source-verified against agents/agent_config.py and config_agent_utils.py.
Address review on the convergence section:
- 'design converges / should lower', not 'now lowers' — the spike does not yet
  implement an AgentConfig-lowering compiler (explicit caveat added).
- precise table: static parallel block -> ParallelAgentConfig; runtime
  fan_out/pipeline/branch have no direct config equivalent (ParallelAgentConfig
  is static parallel sub-agents, not data-mapping over a runtime list).
- soften FQN wording to a trust-boundary mismatch (FQN imports are fine for
  developer-authored config; the concern is a MODEL authoring raw FQNs), not
  'config is unsafe'.
Tie the demo to RFC google#93 §11 without overclaiming: this plan's top-level
sequence is the kind of static shape that should lower to SequentialAgentConfig,
while the reviewer->verifier pipeline (per-item over a runtime list) is exactly
what AgentConfig can't express. Explicitly notes this is a design direction —
the demo runs via SpecInterpreter and does NOT lower to AgentConfig (no such
compiler in the spike). README section + a presenter aside in the narrative.
caohy1988 added 29 commits June 24, 2026 14:01
…allback

'give me the best chart for revenue by region' matched the sequence trigger
'revenue by region' before the tournament trigger 'best chart' (first-match
in definition order) and replayed the frozen Q&A plan. The sequence scenario
is the generic fallback for ANY question, so the router now checks all
specialized scenarios first and falls back to sequence only when none
match. Overlap cases pinned in the routing test (incl. the advertised
prompt 7, which had the same latent collision).
…ifacts

BigQuery Conversational Analytics returns Vega-Lite chart specs; the demo
now does too. New deterministic render_chart capability (no plotting deps):
input-tolerant (query rows dict, raw rows list, tournament winner list, or a
bare chart-type string) -> chart_type + Unicode bar preview (renders in the
ADK Web chat) + Vega-Lite spec. Wired into the sequence recipe (chart the
query rows) and the tournament recipe (render the data with the winning
mark); a 📈 beat surfaces any chart artifact found in interpreter state.
15 -> 16 CA tests, 71 total.
…ne) + inline chart images

Two upgrades from live dogfooding:

* run_query is now a deterministic micro-warehouse: a synthetic 24-month x
  4-region x 4-category fact table + SQL-intent parsing. The executor
  AGGREGATES the facts per the query's grouping (month/region/category),
  window (INTERVAL N YEAR/QUARTER/MONTH), filters (country/region/category
  literals), and measure alias (SUM(...) AS name) — replacing the canned
  keyword rows whose shape couldn't answer a trend question (live gap: a
  monthly-trend SQL got region rows back). Honest scope documented: intent
  execution, not SQL parsing; real BigQuery is the production step.
* render_chart infers a LINE mark for date-shaped x labels (explicit
  tournament winners still win); _chart_png renders the artifact to PNG via
  matplotlib (optional) and the 📈 beat emits it as an inline image part —
  ADK Web shows a real chart; falls back to the Unicode preview.

18 -> 21 CA tests (engine windows/trend-alias-filter/total/category,
line inference, PNG magic bytes, derived encoding); 76 total.
…s for all time buckets

Live gap #2: 'sales trend for global based on 3 years' authored
EXTRACT(YEAR ...) AS year GROUP BY year — the engine only knew monthly
grouping, fell through to a single anonymous grand total, and charted one
'?' bar. Fixes:

* Time grains month/quarter/year (week -> month) detected from
  DATE_TRUNC(..., G), EXTRACT(G FROM ...), AS g aliases, and the GROUP BY
  clause — scoped to the actual clause (stop at ORDER BY/LIMIT, INTERVAL
  phrases stripped, so a trailing 'INTERVAL 1 YEAR' window never reads as a
  yearly grouping).
* Monthly facts bucket into the requested grain (quarters as YYYY-Qn,
  years as YYYY); quarters sum exactly to their year.
* Line-mark inference covers monthly, quarterly, and yearly labels
  (>= 2 points; a single total stays a bar; explicit winners still win).

Pinned: the exact live yearly SQL, quarterly buckets, bucket-consistency,
label inference. 21 -> 22 CA tests, 77 total.
…t + multi-series charts

Replaces semantic guessing with the real thing, per review feedback:

* dry_run -> the actual BigQuery dry-run API (real errors, real
  bytes-scanned); run_query -> real execution against
  bigquery-public-data.thelook_ecommerce billed to GOOGLE_CLOUD_PROJECT,
  with safety rails: maximum_bytes_billed = 2 GB/query, 500-row cap,
  result cells JSON-ified (Decimal/date/datetime). Bare table refs are
  auto-qualified and backticked. Fallback to the deterministic
  micro-warehouse without credentials or with CA_DEMO_USE_BIGQUERY=0;
  every beat carries an engine field (bigquery/mock/mock-fallback) so the
  data source is never misrepresented.
* Multi-series charts: x/series/measure derived from the result shape
  (time fields by name/value, a second categorical becomes one line per
  value, measures picked by name preference — an int year column is never
  mistaken for the measure); Vega-Lite color encoding + matplotlib
  multi-line PNG with legend; ascii preview field-aware and capped.

22 -> 27 CA tests (qualify, jsonify, no-credentials fallback, multi-series
pivot, and a live-gated real-BigQuery round-trip — dry-run bytes, real
error, real rows; passing locally). 81 total.
…rrors; no fabricated fallback

Live finding: nl2sql emitted TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5
YEAR) — a genuine BigQuery dialect error the real dry-run caught — but (a)
the linear sequence plan ignored the verdict and ran the query anyway, and
(b) the executor's no-credentials fallback fired on the FAILING query and
served mock rows as an answer.

* The ask-a-question recipe is now the real CA flow: loop_until(
  draft_or_repair_sql reading the loop-carried failed dry-run output ->
  REAL dry_run) -> run_query -> chart + summarize. The model repairs from
  the actual BigQuery error text.
* _execute_sql no longer fabricates: a failing query returns empty rows +
  the real error (engine: bigquery); the mock engine is reserved for
  missing credentials only.
* Tests: expected sequence shape updated; no-fabrication contract pinned
  with a raising fake client; plan-executing e2e tests pin the mock engine
  (no network in unit tests). 27 -> 28 CA tests, 82 total.
Frozen plans now outlive the session. On freeze, the FULL
FrozenWorkflowRecord (spec, hashes, planner/registry/capability versions +
contract hashes, captured task_input_schema = template promotion) is
exported to ca_plan_store/<scenario>.json — the demo's stand-in for the
ArtifactService in the RFC's revised Q1. A new session's reuse order is:
session state -> plan store -> author fresh. The store path uses the RFC's
DEFENSIVE import: spec_hash recomputed, re-validation against the CURRENT
registry, manual-version + contract-hash drift are hard rejections (shown
in chat, then fresh authoring — drift never silently replays a stale
plan), and a new question is validated against the captured schema
(cross-session template reuse).

Tests: store round-trip + template reuse with a new question, tamper
(hash-mismatch) rejection, contract-drift rejection with an unbumped
version. 28 -> 30 CA tests, 84 total.
…gh repair loop, declared-output contracts

1. Blocker: isort import order in agent.py (pre-commit now clean locally).
2. High: the repair-round claim is now real — Sql echoes the question
   (schema field + instruction) and _bq_dry_run preserves it through every
   branch, so after a FAILED real dry run the loop-carried value holds
   question + sql + error, not sql + error. Pinned with a fake-client
   failure test asserting the question survives.
3. Medium: deterministic capabilities (dry_run, run_query, render_chart,
   profile_table, quality_report, describe_schema, keep_verified, flaky)
   now declare output models, so their contract hashes cover real output
   schemas; plan-store wording softened to declared-contract drift.
4. Medium: PR body refreshed (real BigQuery, plan store, counts).

30 -> 31 CA tests, 85 total.
…n manual versions

Review precision item: sql_ok / pair_charts / judge_chart / single_chart
return bare bool/str/list values (wrapping them in declared models would
change runtime shapes their consumers depend on), so their contract hashes
cover input_kind only. README and PR wording now scope the claim to the
typed object-output capabilities; test-count wording fixed (31 collected =
30 CI-safe + 1 live-gated).
…rectly, no workflow

Live finding: 'tell what kinds of workflow you can issue?' fell through the
trigger router to the data scenario; the frozen plan replayed and NL2SQL
turned a question about the AGENT into a query about order statuses.

The root agent now has a front door — the RFC's no-plan escape hatch
(DESIGN §12) implemented: untriggered messages are intent-classified
(data | meta | chat). Meta/chat turns get a direct conversational reply (a
catalogue of the seven workflow shapes, built from SCENARIOS so it never
drifts) at the cost of one intent call — no planner, no queries. Data
questions and trigger prompts proceed unchanged. Gate instruction and
catalogue are template-safe (brace-free), pinned by test alongside the
routing split (triggered bypasses gate; the exact live meta-question gates).

31 -> 32 CA tests, 86 total.
Review items: (1) pin the FULL no-workflow escape path — with a stubbed
intent gate returning meta, the root agent's run yields the gate reply and
the conversation output and returns BEFORE plan-store import, session
replay, authoring, or execution (empty plan store asserted untouched; no
authored/reused/validation beats in the transcript). (2) PR body counts
synced (33 collected = 32 CI-safe + 1 live-gated; 88 total collected).
…rated, then canned

Live finding: 'audit this insight <X>' selected the adversarial mode but
audited the CANNED demo insights — the user's claim was discarded (mode
selectors kept canned task inputs). The audit scenario now resolves its
insights in order: (1) inlined in the message ('audit this insight: X',
';'/newline-separated lists, typo-tolerant filler), (2) the session's last
generated insight ('audit that insight' after a question — each sequence
result is remembered in state), (3) the canned demo set as final fallback.
The banner states which source is being audited. Frozen plan unchanged —
new insights are template reuse through the same fan_out(skeptic) plan.

33 -> 34 CA tests, 88 total.
…dit beat

Live finding: the skeptic step was a raw event blob with a bare
(insight, refuted) pair — no reasoning, easy to miss, and with one insight
only one skeptic runs so there was 'nothing to see'.

* Verdict gains a required-by-instruction reason field — the skeptic must
  show its work (what it checked, why the claim stands or falls, caveats
  like partial years).
* New audit beat renders every verdict found in interpreter state as one
  line: REFUTED / upheld — insight — reason.
* _verdict_of carries the reason (tolerant of JSON-string verdicts);
  insight extraction strips trailing punctuation from user phrasing.

34 -> 35 CA tests, 89 total.
…insight

User question: does the skeptic fan-out really dispatch isolated agents, or
do they read the conversation history? Settled empirically: a spy on the
model layer (before_model_callback short-circuits, no network) captures
each fanned-out skeptic's actual LLM request. Asserts: one real dispatch
per insight; each request contains exactly its own insight — not the
sibling's insight, not a planted prior chat beat, not the user's turn
message. ADK's workflow wrapper runs plain agents in single_turn mode with
per-dispatch isolation scoping, so the model call gets node_input only.

This is the runtime half of the independence story (the binding lints are
the static half). 35 -> 36 CA tests, 90 total.
…olation (tool-loop fix)

The data-grounded skeptic (a real BigQuery verification tool on the
adversarial reviewer) exposed a REAL isolation bug, fixed at the source:

* fix(src/workflow): prepare_llm_agent_context dropped the dispatch's
  isolation_scope — Context only inherits it via parent_ctx, which is not
  passed — so the single_turn input event was appended UNSCOPED. A
  fanned-out agent making MULTIPLE model calls (tool loop) rebuilds its
  contents per call and picked up the LATEST sibling's input instead of
  its own: parallel tool-using skeptics all answered the last claim
  (single-call agents were correct only by timing). The agent context now
  carries the scope; verified offline (spy-driven simulated tool loop) and
  live (per-claim verdicts with real queried evidence).
* spike(google#92 supervisor): every dispatch now runs in its own sub-branch and
  its own isolation scope (parent_scope::run_id) — per-dispatch context
  independence is structural for multi-call children.
* feat(ca-demo): the skeptic is DATA-GROUNDED — output_schema + a real
  query_thelook tool (capped, engine-honest); instruction demands queried
  evidence in the verdict reason; capability version bumped to 2 so stored
  plans drift-reject instead of silently reusing the plausibility-only
  skeptic. Live verification: the $1M-AOV claim is REFUTED with the actual
  computed ~$86 AOV; the Shipped-status claim upheld with real counts.
* tests: tool-loop sibling-isolation regression (deterministic spy FC, no
  network; gated on the patched wrapper so stale local installs skip);
  single-call isolation assertions made version-tolerant; grounded-skeptic
  config + tool pins. CA suite 38 collected; full suite 92 under the
  patched tree, 91+skip under a stale ADK.
…tated HTML page

Reads every FrozenWorkflowRecord envelope in ca_plan_store/ and writes a
self-contained plan_inspector.html: the plan's dataflow as a diagram and
every envelope field annotated with the guarantee it delivers — spec_hash
(tamper evidence), planner_model/created_at (authoring provenance),
registry + capability versions and derived contract hashes (drift
detection), task_input_schema/digest (cross-session template reuse),
validation (recorded, never trusted). The on-camera artifact for the
"plans as durable, auditable data" story.
The banner still said 'over mock thelook_ecommerce' — stale from before
the real-BigQuery upgrade and misleading on camera. It now reports LIVE
bigquery-public-data.thelook_ecommerce when credentials allow, or the mock
warehouse fallback, matching the engine field on every result.
…ES__ metadata

Live finding: the demo listed only 4 of thelook_ecommerce's 7 tables, and
profile_table still returned CANNED values chosen to look plausible —
exactly the kind of fabrication the engine-honesty rule exists to prevent.

* TABLES now covers the full real dataset (adds events, inventory_items,
  distribution_centers) — also widens nl2sql coverage to event/inventory
  questions.
* profile_table queries the real __TABLES__ metadata (row_count, size_mb;
  cheap metadata reads) with engine=bigquery, falling back to clearly
  labeled canned values without credentials. Live-verified: events 2.43M
  rows / 368MB, distribution_centers 10 rows.
* TableProfile/QualityReport contracts updated (row_count + size_mb;
  report = totals + largest table) — a DELIBERATE contract change: stored
  fanout plans drift-reject and re-author rather than running stale
  semantics. Fanout e2e pins all seven tables via the mock engine.
…ve path

User ask: make the whole demo run on the real thelook_ecommerce dataset.
Remaining fakery removed:

* describe_schema v2: a data-grounded agent (query_thelook tool) that
  answers metadata questions by querying the REAL data (DISTINCT values,
  counts) — live-verified answer cites shipped_at/delivered_at semantics
  it discovered itself. The canned _SCHEMA_NOTES sentence is gone.
* The repair scenario repairs a REALLY broken query: the task carries SQL
  referencing thelook_ecommerce.order (not orders); the REAL dry-run
  rejects it, the repair step fixes it from the actual BigQuery error, and
  run_query returns real rows (live-verified: China $364K...). The
  simulated flaky_dry_run capability and _FLAKY_CALLS are deleted —
  transient-failure simulation now lives only in the CI test stub.
* render_chart: a bare tournament winner now charts REAL revenue-by-region
  rows fetched from BigQuery (canned rows remain only inside the
  no-credential mock engine).
* Chart/loop/branch/tolerance tests pin the mock engine explicitly (no
  network in unit tests); the branch e2e stubs the now-LLM describe_schema.

The only non-real component left BY DESIGN is the no-credential fallback
warehouse, always labeled via the engine field. 38 CA tests collected; 91
green locally (92 under the patched ADK tree).
…the stray)

The console's Agents Hub shows 8 tables; the demo's hand-written catalogue
listed 7. The 8th is real: 'thelook_ecommerce-table', an empty stray
placeholder (0 rows, dashed name) in the public dataset. The profiling
fan-out now discovers the table list live from __TABLES__ (cached per
process; curated-catalogue fallback without credentials), so it always
matches what the console shows — and surfacing the empty stray is itself
an honest data-quality finding. Profile-name validation now permits dashed
table names. 39 CA tests collected.
Matches the production CA agent's 7-table scope: the 0-row
'thelook_ecommerce-table' placeholder is excluded from profiling fan-outs.
…/run/chart)

Found via the production-CA head-to-head: the shipped pipeline scenario
only translated+validated panels; the production agent executed them. Our
vocabulary already expresses the full shape — pipeline stages
draft_or_repair_sql -> REAL dry_run -> run_query -> render_chart, per
panel, barrier-free. Live result: 3 real panels (12-row revenue line,
top-5 categories bar, 60-row multi-series users-by-traffic-source line) in
10.0s / 12 dispatches against the real dataset. Panel questions updated to
concrete 2025 asks; e2e pins 3 executed chart artifacts.
…visions

The CA head-to-head showed our determinism was plan-level only — the
drafting LLM still varied date windows run-to-run. SQL freezing extends
the freeze one level deeper, and human feedback makes the artifact
governable:

* After a question's SQL passes the REAL dry-run, it is frozen to
  ca_plan_store/sql/<question-digest>.json (sql, sha256, engine, bytes,
  validated_at, revisions[]). Re-asking the exact question replays via a
  STATIC replay plan (dry_run -> run_query -> chart -> summarize; constant
  hash, no drafting step): the dry-run re-validates (warehouse-drift
  detection) and the numbers are deterministic. Live-verified: identical
  $644,971.22 across runs with the drafting LLM skipped.
* 'revise: <feedback>' revises the frozen SQL for the session's last
  question (draft capability is now feedback-aware), MUST pass the real
  dry-run before replacing the artifact, and records the feedback +
  previous SQL in the revisions history — an auditable trail of who
  changed the query and why. Live-verified: 'exclude Cancelled/Returned'
  -> validated revision -> re-frozen -> executed (China $644K -> $480K).
* Drift safety: a frozen SQL that no longer validates falls back to fresh
  authoring with the rejection shown.

Tests: store roundtrip + normalized digests + revision history, static
replay plan (constant hash, lint-clean, no drafting capability), e2e
replay with 4 dispatches (no draft), revise-trigger routing. 95 total.
…istory

The 🧊 cards show the pinned statement (replays skip the drafting LLM),
sql_hash/validated_at/engine, and every human-feedback revision with the
preserved previous SQL — the numeric-determinism and governance benefits
made visible alongside the plan envelopes.
plan_inspector.py now takes an optional session id (plus app/user) and
reads the actual session from the running ADK server: each user turn is
rendered as a timeline card classified by the mechanism that answered it
(frozen-SQL replay / human revision / authored fresh / frozen-plan reuse /
conversation), with the sql hash, applied revision count, and the
resulting insight — so the page tells the story of the demo run on screen,
with the artifacts those turns created and reused directly below.

  python .../plan_inspector.py <session-id> [app] [user]
…frozen middle results

The page now pitches RFC google#93 with the demo as evidence, not the demo
itself. The model-authored typed plan is the centerpiece; the validated
intermediate results its steps produce (the dry-run-checked SQL, in this
instance) render INSIDE the workflow card, attached to the step that
produced them — the drafting loop carries a 'result frozen, SKIPPED on
replay' badge. Session timeline beats use RFC vocabulary (model authored
the workflow / frozen-workflow replay / step-result replay /
human-governed revision), and a closing card generalizes the mechanism
beyond SQL (retrieved schemas, verified claims, chart specs) into the
RFC's freezing tiers v1 / v1.1 / v1.2 / v2.
Alignment check against a live session showed the page mixed THIS
session's frozen middle result with artifacts from other sessions (the
store is global by design). With a session id, middle results now filter
to questions actually asked (or revised) in that session; artifacts from
other sessions collapse into a labeled details card — the page mirrors
the demo run on screen while still showing the cross-session store.
User question exposed the gap: middle results were stored BESIDE the
frozen workflow (linked only by the inspector's heuristic), not
structurally attached to it. The artifact now records plan_hash and
produced_by_step at freeze time; revisions inherit the lineage. Design
kept deliberate: the plan envelope stays a question-agnostic TEMPLATE
(embedding per-question instances would bloat and conflate it) — middle
results are per-task-input instances that REFERENCE their plan. The
inspector attaches artifacts to workflow cards by recorded lineage
(heuristic fallback for pre-lineage records) and renders the lineage on
each card. Lineage pinned by tests incl. revision inheritance.
@caohy1988 caohy1988 force-pushed the spike/dynamic-supervisor-concurrency branch from 918e81c to b41323a Compare June 24, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant